Milestone 1

1. Data Acquisition

The data acquisition process is one of the most important and initial steps of any data science pipeline. The data acquisition process for this project is as follows:

  • Data Source: The NHL API - https://statsapi.web.nhl.com/api/v1/
  • Data Format: JSON
  • Data Retrieval: Using the requests library, I will send a GET request to the NHL API to retrieve the data. The API will return a JSON object containing the requested data. The data will then be parsed and stored in a Pandas DataFrame in step 3.

1.1. Data Retrieval

The following code snippet demonstrates the data retrieval process. The function game_id_generator() generates a list of game IDs for a given season. The class DataDownloader() is responsible for retrieving the data for a given game season and stores it in a JSON file.

def game_id_generator(year: int) -> [str]:
    year = str(year)
    total_games = (1230 if year == '2016' else 1271)
    ids = []
    # this is the regular season
    for j in range(1, total_games+1):
        ids.append(year+'02'+'{:04d}'.format(j))

    # this is the playoffs
    for i in range(1, 10):
        for j in range(1, 10):
            for k in range(1, 8):
                ids.append(year+'030'+str(i)+str(j)+str(k))
    return ids

1.2. Data Storage

The file data_downloader.py contains the class DataDownloader() which is responsible for retrieving the data for a given game season and stores it in a JSON file. The class contains the following key methods:

  • __init__(self, path: str|None, rewrite: bool = False,threaded:bool=True, worker:int=10, logger_path: str|None = None, log_level: int|None = logging.INFO): The constructor of the class. It takes path to the directory where the data will be stored, a boolean value indicating whether to rewrite the data if it already exists, a boolean value indicating whether to use multithreading, the number of threads to use, the path to the logger file, and the log level. The default values are set to None, False, True, 10, None, and logging.INFO respectively.
  • download(self, year: int) -> None: This method is responsible for downloading the data for a given year. It takes the year as an argument and returns None.

A major feature of the Downloader class is that it can be used to download the data in parallel. This is achieved by using the threading module.

2. Interactive Debugging Tool

Alt text

The screenshot displays some information about a specific game and an event within it, all of which can be dynamically configured using four interactive widgets. Below is the implementation of the tool.

files = getFiles(f'201602')
data = read_data(files[0])

# Initialize widgets
seasons = widgets.Dropdown(
    options=['2016', '2017', '2018', '2019', '2020'], description='Season:')
game_type = widgets.Dropdown(
    options=['Regular', 'Playoffs'], description='Game Type:')
game_id_slider = widgets.IntSlider(
    min=1, max=len(files), step=1, description='Game ID:')
event_slider = widgets.IntSlider(min=1, max=len(
    data['liveData']['plays']['allPlays']), step=1, description='Event:')


def update_game_id_slider(*args):
    global files
    global data
    files = getFiles(f'{seasons.value}{game_type_digits[game_type.value]}')
    game_id_slider.value = 1
    game_id_slider.max = len(files)

    update_event_slider()


def update_event_slider(*args):
    global data
    global files
    data = read_data(files[game_id_slider.value-1])

    event_count = len(data['liveData']['plays']['allPlays'])
    if(event_count):
        event_slider.max = event_count
        event_slider.value = 1
        event_slider.min = 1
    else:
        event_slider.value = 0
        event_slider.min = 0
        event_slider.max = event_count


def update_event_plot(season, game_type, game_id, event_index):
    events = data['liveData']['plays']['allPlays']
    if (not events):
        print('No event')
        return

    print("gameId: ", data['gamePk'])
    home = data['liveData']['linescore']['teams']['home']['team']['abbreviation']
    away = data['liveData']['linescore']['teams']['away']['team']['abbreviation']
    print(f'{home} vs. {away}')

    event_data = events[event_index-1]

    coordinates = event_data['coordinates']
    if (not coordinates):
        return print(json.dumps(event_data, indent=4))

    period = event_data['about']['period']
    t = [i for i in data['liveData']['linescore']
         ['periods'] if i['num'] == period]
    if (t):
        isHomeOnRight = 1 if t[0]['home']['rinkSide'] == 'right' else -1

    summary = f"Event: {event_data['result']['event']}\nPeriod: {event_data['about']['period']}\nTime: {event_data['about']['periodTime']}\nTeam: {event_data['team']['name']}"

    print(summary)
    plt.title(event_data['result']['description'], y=1.1)

    plt.imshow(rink_image_np, extent=[-100, 100, -42.5, 42.5])
    plt.ylim(-42.5, 42.5)
    plt.xlim(-100, 100)
    plt.xticks([-100.0, -75.0, -50.0, -25.0, 0.0, 25.0, 50.0, 75.0, 100.0])
    plt.yticks([-42.5, -21.25, 0, 21.25, 42.5])
    plt.scatter(coordinates['x'], coordinates['y'])
    plt.text(isHomeOnRight*(-75), 47, away, ha='center',
             va='center', fontsize=12)
    plt.text(isHomeOnRight*(75), 47, home, ha='center',
             va='center', fontsize=12)
    plt.xlabel("Feet")
    plt.ylabel("Feet")

    plt.show()


seasons.observe(update_game_id_slider, 'value')
game_type.observe(update_game_id_slider, 'value')
game_id_slider.observe(update_event_slider, 'value')

# Create interactive plot
interactive_plot = interactive(
    update_event_plot, season=seasons, game_type=game_type, game_id=game_id_slider, event_index=event_slider)
output = interactive_plot.children[-1]
output.layout.height = '450px'

display(interactive_plot)

3. Tidy Data

3.1. A small snippet of your final dataframe

First 10 rows of tidied dataframe:

Alt text

3.2. Adding the actual strength information to both shots and goal

Assuming penalty events are provided with a start time (X), duration (T), and the penalized team (A), any events occurring within the time frame (X + T) will see team (A) with a reduced player count by at least one, compared to the last event before time (X). This principle also applies to the opposing team. We will maintain a record of the number of players on each team from the start of the game (typically 5-5) until its conclusion. Consequently, we can deduce the on-ice strength during shots and goals within the time frame (X + T) based on the team executing the event.

3.3. Additional features

Real-time performance analysis enables a detailed examination of both team and player behaviors during a game. I will incorporate three metrics for each team, calculated from the start of the game up to each event:

  • Goals per Shot: Calculated as the number of goals divided by the number of shots, this metric gauges scoring efficiency.
  • Saves per Shot: Determined by dividing the number of saves made by the number of shots faced, this metric assesses goaltender performance.
  • Faceoff Win Rate: Calculated as the number of faceoffs won divided by the total number of faceoffs that have occurred, this metric provides insight into a team’s control over puck possession.

4. Simple Visualizations

5. Advanced Visualizations: Shot Maps

5.1 Shot Maps

  • Shot Maps for 2016
  • Shot Maps for 2017
  • Shot Maps for 2018
  • Shot Maps for 2019
  • Shot Maps for 2020

5.2 Discussion

From these plots, you can infer the playstyles of different teams in a given season. By observing zones of excess shots (darkest red), you can determine where a team typically shoots from and whether it’s closer to the goal or not. You can also notice the side, which might be influenced by whether the shooters are right or left-handed, for example. Looking at the overall picture, you can also draw conclusions about the average shot rate; if a team has a blue or red area across the board, it indicates that they shoot, on average, less or more than the league average, respectively. Having these figures for multiple seasons, you can also track how the playstyles of different teams and the league as a whole evolves over the years.

5.3 Consider the Colorado Avalanche; take a look at their shot map during the 2016-17 season. Discuss what you could say about the team during this season. Now look at the shot map for the Colorado Avalanche for the 2020-21 season, and discuss what you could conclude from these differences. Does this make sense? Hint: look at the standings.

Upon examining the two shot maps, we can see that the Colorado Avalanche team was significantly more active in the 2020-2021 season compared to the 2016-2017 season. In the 2016-2017 season, the team was notably more active on the left side of the offensive zone, but they shot less than the league average in the middle of the offensive zone. In the 2020-2021 season, they were slightly less active near the goal but much more engaged in the middle, with a broad region of red between 20 and 60 feet from the center of the rink. During the 2020-2021 season, the Colorado Avalanche finished first in the league, while they ended up 30th in the 2016-2017 season. What appears to be a change in playstyle, characterized by an increase in shots from further out, seems to have contributed to a better standing. However, these observations must be taken with caution because we are comparing the team indirectly against the league average for those seasons. So, we cannot be entirely certain that the team shot more in 2020-2021, only that they shot more than that year’s specific average.

5.4 Consider the Buffalo Sabres, which have been a team that has struggled over recent years, and compare them to the Tampa Bay Lightning, a team which has won the Stanley for two years in a row. Look at the shot maps for these two teams from the 2018-19, 2019-20, and 2020-21 seasons. Discuss what observations you can make. Is there anything that could explain the Lightning’s success, or the Sabres’ struggles? How complete of a picture do you think this paints?

Analyzing such intricate maps can be challenging when examining numerous figures simultaneously. If we compare the teams season by season, it appears that, on average, the Tampa Bay Lightning has a higher shot rate than the Buffalo Sabres. In the plot for 2018-2019, for instance, we observe that the TB team had a higher average shot rate in the two faceoff circles closest to the goal. Another observation is that the Tampa Bay Lightning seems to shoot more from the right side, suggesting that having a strong right winger might contribute to their success. We can also see that both teams do not do a lot of tip-ins (blue area around the goal for each season), but except for the 2019-2020 season, this is much more marked with the Buffalo Sabres. When creating these figures and selecting smoothing parameters, we opted to retain a certain level of noise for a more accurate representation of the data, although this makes the maps slightly more challenging to interpret at first glance. Even so, these shot maps definitely do not provide a complete picture. We understand that the distance to the goal correlates with the goal percentage, but the shot type and numerous other factors are also influential. Possessing a map that displays goal rate is more crucial than shots, as ultimately, it’s goals that determine the game’s outcome, and a team that shoots frequently with low accuracy won’t excel. We also need to consider the defensive zone, as a well-rounded team excels in both offense and defense.